Universal parameters of bulk-solvent masks

The optimal choice of bulk-solvent mask parameters (grid step, and solvent and shrinkage radii) has been revised.


Introduction
Bulk solvent (or disordered solvent) on average occupies about half the volume of a macromolecular crystal and noticeably contributes to the medium-and low-resolution structure factor intensities [see e.g.Weichenberger et al. (2015) for a recent review].It is therefore important to include its contribution into the model-calculated structure factors to account for the entire unit-cell contents adequately.The procedure needs to be fast and accurate because this calculation is repeated many times during atomic model refinement.
The flat (or mask-based) bulk-solvent model (Jiang & Bru ¨nger, 1994) is currently the option of choice in most crystallographic software packages.The model first requires the definition of a solvent mask in the unit cell.This mask is a binary function calculated on a regular grid with values of zero inside the molecular region and one outside.The Fourier coefficients F mask (s) of this binary mask are then calculated and scaled together with the structure factors F calc (s) calculated from the atomic model, The resolution-dependent scales k mask (s) and k total (s) are obtained by fitting F model (s) to the experimental data [see e.g.Afonine et al. (2013)].
The mask calculation as introduced by Jiang & Bru ¨nger (1994) uses the following parameters: (i) The size of the grid step d grid .
(ii) The solvent probe radius r probe or r solv .
(iii) The shrinking radius r shrink .
(iv) Tabulated atomic van der Waals radii.The mask calculation procedure involves augmenting the atomic van der Waals radius with the solvent probe radius to create a sphere of combined radius around each atom.Grid points falling outside of these spheres, which define the expanded macromolecular region, are designated as the solvent-accessible region surrounding the macromolecule.Subsequently, all grid points within a distance r shrink from any point of the tentative solvent-accessible region defined above are assigned to the bulk-solvent region.The resulting mask is referred to as the bulk-solvent mask.
An optimal choice for these parameters should balance structure factor accuracy and the time required to compute the mask and its Fourier coefficients by Fourier transformation.Based on two cases at 2.2 and 1.8 A ˚resolution, Jiang & Bru ¨nger (1994) determined optimal choices for r probe , r shrink and d step to be 1.0, 1.1 and (d min /4) A ˚, respectively, where d min is the resolution of the data set.Later, Rees et al. (2005) showed that for low-resolution data sets a step size of (d min /4) A ˚is too coarse, leading to inaccurate masks, and that a step size somewhere between d min /5 and d min /10 is more appropriate.While this solves the problem of mask accuracy at low resolution, at high resolution such a fine grid step will result in a significant (and unnecessary) increase in computational time.
In this work, we suggest that the grid step for mask calculation should not depend on the resolution.We demonstrate that using a step size of 0.6 A ˚, along with values of r solv and r shrink set to 1.1 and 0.9 A ˚, respectively, does not compromise the accuracy of the mask or the calculation time.Therefore, we recommend this combination of parameters for calculating the bulk-solvent mask for structures of any resolution.

Why is a common resolution-independent grid step expected?
The electron-density distribution of a macromolecule in a crystal is a peaky function, while the function that describes the bulk solvent is a flat function with a smooth border [see e.g.Fenn et al. (2010)].Consequently, the Fourier coefficients that describe the bulk-solvent distribution decrease sharply, and usually become negligibly small, at resolutions better than d solv ' 3.5-4.0A ˚[see e.g.Phillips (1980), Jiang & Bru ¨nger (1994) or Afonine et al. (2013)].To understand the consequences of this on the choice of the grid step, we refer to a one-dimensional example below.
For a periodic function of a single variable with period a, the integral Fourier transform gives an infinite number of Fourier coefficients F(h), where h is an integer, both positive and negative.When such a function is sampled on a regular grid with N points per interval, the discrete Fourier transform yields only N independent values F grid (h), e.g.  2) for three-dimensional functions can be found in Sayre (1951), Lunin et al. (2002), Navaza (2002) and Afonine & Urzhumtsev (2004).This suggests the potential existence of a universally optimal grid step d 0 grid for the problem under study, which is related to d solv ' 3.5 A ˚in a manner similar to (3), albeit with a scale factor that may not be equal to 1 2 ; the latter arises from the facts that these high-resolution structure factors may be different from exactly zero and the Fourier analysis is carried out in three-dimensional space.

Models and data
The search for the optimal bulk-solvent mask parameters was conducted using 277 quality-filtered models and X-ray diffraction data obtained from the Protein Data Bank (PDB; Burley et al., 2021).The quality filters included a crystallographic R factor better than 0.25, overall and per-resolution shell data completeness above 95%, no data pathologies such as twinning, and a relatively high upper data resolution limit (d min � 3.0 A ˚).To accelerate the calculations, we excluded very large models.The results obtained with these 277 models were then validated with a larger set of 2077 structures used in a previous bulk-solvent study (Afonine et al., 2013) and representing a broad range of model sizes and data resolutions, from subatomic to low.

Finding optimal values for r solv , r shrink and d grid
For each selected PDB entry, the values of d grid were systematically sampled in the range between 0.2 and 1 A ˚in steps of 0.1 A ˚, and both r solv and r shrink were sampled between 0.5 and 1.5 A ˚, also in steps of 0.1 A ˚.For each trial triplet of values (r solv , r shrink , d grid ), the scales k total (s) and k mask (s) in (1) were re-calculated as detailed by Afonine et al. (2013), and the R-factor values, referred to as R (4) , were calculated using all reflections up to 4 A ˚resolution.In what follows, R ð4Þ n stands for R (4) calculated for the structure numbered n.These values were the principal information to identify potential universal values for r 0 solv , r 0 shrink and d 0 grid .The details follow in Section 3.

Variation of the optimal R factor with the mask grid step
The search for common parameters was based on the hypothesis that there is a common behaviour of R ð4Þ n with parameter values for different structures.First, we tried to decouple the search for optimal d grid and (r solv , r shrink ).This was achieved by finding the combination of (r solv , r shrink ) that minimizes R ð4Þ n for each trial d grid and for each structure n, � R Obviously, these values are different for each structure, but they vary with d grid very similarly.In particular, this is true for their variation around the average of � R ð4Þ n ðd grid Þ over grid steps Subtracting the average in (5) makes the dependency of � � R ð4Þ n ðd grid Þ on d grid similar for all structures, which in turn makes it possible to analyse the average of these dependencies over all structures (Fig. 1).
It is to be expected that � � R ð4Þ n ðd grid Þ should increase with step size.However, the value does not change significantly for d grid in the range of 0.2-0.4A ˚, suggesting that steps smaller than 0.4 A ˚are unnecessarily small.Above this step value, � � R ð4Þ n ðd grid Þ starts to increase, and the goal is to find a compromise between the introduced errors and the gain in computation time.Increasing the step from 0.4 A ˚to 0.6 A ˚or 0.8 A ˚increases the grid size, and therefore the number of computing operations, by about four times or eight times, respectively.
A step size of 1.0 A ˚resulted in very large errors and was excluded from further analysis.Calculations with a step size of 0.9 A ˚resulted in a large number of outliers with large � � R ð4Þ n , making this step size also unsuitable.This leads to 0.4-0.8A ˚as the range for the grid-step search.

Acceptable combinations of parameters
Next, for each model n, we analysed the parameter values that lead to the lowest R ð4Þ n value across all combinations of (r solv , r shrink , d grid ), It is possible that, for a given structure, several combinations of (r solv , r shrink , d grid ) result in R ð4Þ n values that are close to the global minimum of (6).To address such small fluctuations in the R ð4Þ n values, we introduce a parameter " R considering all values R ð4Þ n ðr solv ; r shrink ; n , where the value of " R varies in the range 0.001-0.002.
The parameter values (r solv , r shrink , d grid ) corresponding to (6) are expected to vary from one structure to another, and we are looking for the combinations that are persistent over all structures.As a formal quantitative measure, for each set of parameters (r solv , r shrink , d grid ) and for each structure n, we calculate a non-negative value To be able to combine the distribution of (r solv , r shrink , d grid ) for each structure into one cumulative distribution over all structures, we convert �R ð4Þ n in (7) into P n r solv ; r shrink ; with constants C > 0 and " R .The product of (8) over all structures, P r solv ; r shrink ; reflects both the contrast of the lowest R ð4Þ n for an individual structure and the persistence of the parameter values over all structures.P(r solv , r shrink , d grid ) varies from 0 to 1; the higher the value of ( 9), the more preferable the combination of parameters.Calculations with data sets of different sizes suggested the choice of C in the range between 0.01 and 1.0 and " R as stated above in order to obtain a decent contrast while keeping points with neighbouring R ð4Þ n values.In general, we observed that the variation in the constants C and " R around the values given above obviously modifies the contrast of the distribution (9) while not influencing the location of its peaks.
Fig. 2 shows the results with " R = 0.002 and C = 0.5.As expected, the distribution shows that mask shrinking does not impact the results when r shrink < d grid (rectangular bottom-left regions).Also, the distribution shows a clear nearly linear correlation between r solv and r shrink , giving preferable values roughly at the line r solv À r shrink ' 0.2 A ˚for each grid step.
Finally, it shows that the optimal radii (r solv , r shrink ) cluster in the range (1.1 � 0.1 A ˚, 0.9 � 0.1 A ˚).
As expected, increasing the grid step makes R ð4Þ n worse.In agreement with the first test (Fig. 1), most frequently the lowest R ð4Þ n occurred for a grid step size of 0.3-0.4A ˚(P values up to 0.99).Consequently, a step of 0.4 A ˚may be considered as a candidate for the most accurate calculations since a smaller step of 0.3 A ˚does not significantly improve the R factors while leading to an increased computational time.Using a grid with step sizes of 0.5-0.6A ˚makes it possible for R ð4Þ n to be close to � R 2), light green (0.2 � P < 0.9) and dark green (P � 0.9).d grid = 0.6 A ˚is a good potential compromise candidate for the universal value.
The values of the radii (r solv , r shrink ) leading to the minimum � R ð4Þ n value in (6) also varied only slightly over the trial grid step sizes.This provided a relatively small number of tentative combinations (r solv , r shrink , d grid ) to identify the optimal ones which we denote ðr 0 solv ; r 0 shrink ; d 0 grid Þ.

Optimal set of parameters
The analysis described in Sections 3.1-3.2results in a range of (r solv , r shrink , d grid ) parameters minimizing R (4) on average for all test models.These values, however, do not necessarily lead to the lowest R (4) for a particular model.
Next, we can ask which of these combinations, if any, lead to a value of R ð4Þ n ðr solv ; r shrink ; d grid Þ that is larger than, and by how much, the lowest value of � R ð4Þ n in ( 6) for what fraction of structures.To answer this question, we calculate the fraction p(�R) of the structures with the difference for different �R values.The expected solution corresponds to the minimum of the p(�R) function.For the sake of The fraction of the models that satisfy R ð4Þ n ðr solv ; r shrink ; n > �R shown as a function of �R.The colour scheme for the grid step is the same for all plots and is given in the top right plot.The values of ðr solv ; r shrink Þ are indicated in each plot.completeness, we calculated p(�R) for all triplet values of the parameters considered above.
The combination (r solv = 1.0A ˚, r shrink = 1.0A ˚) gives poor results for all grid step sizes and the combination (r solv = 1.2A ˚, r shrink = 0.8 A ˚) gives results acceptable only for very small grid steps, d grid = 0.3-0.4A ˚(Fig. 3).The best results are observed for the sets of r solv , r shrink with r solv À r shrink = 0.2 A ˚, with a slight preference for the combination (r solv = 1.1 A ˚, r shrink = 0.9 A ˚, d grid = 0.6 A ˚). Here, R ð4Þ n increased by less than 0.5% for all structures of the search set except one, for which this value was below 0.6%.The same plots (Fig. 3) indicate that if more accurate calculations are required, then the combination (r solv = 1.1 A ˚, r shrink = 0.8 A ˚, d grid = 0.4 A ˚) is optimal.However, using this finer grid step would lead to a nearly fourfold increase in the number of grid points and, consequently, in the number of computational operations required.Conversely, for a very large structure, if a coarser grid is acceptable, a possible combination would be (r solv = 1.0A ˚, r shrink = 0.8 A ˚, d grid = 0.7 A ˚), resulting in only a slight increase in overall R factors.

New versus old mask calculation parameters
Finally, for each model from the complete data set, we analysed how much the R (4) and the R factor calculated using all reflection data change if the new mask calculation parameter values (r 0 solv = 1.1 A ˚, r 0 shrink = 0.9 A ˚, d 0 grid = 0.6 A ˚) are used instead of the values of (r solv = 1.1 A ˚, r shrink = 1.0A ˚, d grid = d min /4) used previously.
Fig. 4 shows that R ð4Þ n varies little and typically remains within � 0.3% for most structures, with the exception of a few cases where it varies within � 0.5%.We consider these variations negligible.As expected, R n changes even less than R ð4Þ n , since the bulk-solvent contribution vanishes beyond 3-4 A ˚resolution.
The only notable outlier is PDB entry 3b6a (Willems et al., 2008), for which �R ð4Þ n = 0.92% (�R n = 0.64%).This structure was solved at a resolution of d min = 3.0 A ˚, which means that the original algorithm used a mask with a grid step size of 0.75 A ˚, coarser than the proposed 0.6 A ˚.This seemingly counterintuitive result can be rationalized as follows.The bulk-solvent mask typically consists of a large region and several (often many) smaller isolated regions (Afonine et al., 2024).These small regions are typically cavities inside the protein or computational artefacts.The number and size of such regions vary based upon the choice of mask parameters (r solv , r shrink , d grid ).With a step size d 0 grid = 0.6 A ˚, the mask for 3b6a contains about 20 isolated small regions incapable of containing even a single disordered water molecule.Excluding these regions from the bulk-solvent mask reduces �R ð4Þ n and �R n to 0.36% and 0.26%, respectively, suggesting that these regions are computational artefacts.

Conclusions
The choice of mask parameters for the flat bulk-solvent model, i.e. solvent and mask shrinkage radii and the grid sampling step size, affects both the accuracy of the fit between model and data at medium to low resolution and the speed of the calculations.Since accounting for the bulk solvent typically occurs in crystallographic calculations that involve atomic model and reflection data, from simple operations like Rfactor or map calculation to complex procedures like model refinement and building, the computational efficiency of this step is critical.The parameters governing the speed and accuracy of the flat bulk-solvent model are the solvent radius r solv , the mask shrinkage radius r shrink and the grid step d grid for the mask sampling.When this model was introduced (Jiang & Bru ¨nger, 1994) the choice of values for these parameters, of r solv = 1.0A ˚, r shrink = 1.1 A ˚and d grid = (d min /4) A ˚, was based on only two study cases at medium resolution (around 2 A ˚).A decade later, this choice was revisited for low-resolution cases by Rees et al. (2005), resulting in the suggestion that much The variation in �R (4) calculated for the test set of 2077 models when moving from the conventional set of mask parameters to the selected set r 0 solv ; r 0 shrink ; d 0 grid of optimal values.Each point corresponds to an individual model.The plots show the distribution of the variation in �R (4) (vertical) versus (left) the variation �R in the overall R factor calculated for the whole set of reflections and (right) the R (4) value itself.For the model of 3b6a (indicated by red arrows) the values (0.64; 0.92) become (0.26; 0.36) after the removal of isolated small-volume regions appearing in the new mask.
finer steps are needed at these resolutions.While these finer steps, calculated as a fraction of d min , address the problem of accuracy for low-resolution models, they in turn create the problem of substantially increasing the computational time for higher-resolution cases.A universal resolution-independent choice of mask calculation parameters is therefore highly desirable, and we here show that this is possible.A systematic study across hundreds of models from the PDB performed in this work reveals an optimal choice of these parameters to be r 0 solv = 1.1 A ˚, r 0 shrink = 0.9 A ˚and d 0 grid = 0.6 A ˚. Validation of this choice with a much larger test set of models shows that these values are broadly applicable.In the last stages of refinement, a finer grid with a step d grid = 0.4 A ˚and radii r solv = 1.1 A ˚, r shrink = 0.8 A ˚may possibly be used to improve the results further.The parameters described here are implemented in CCTBX and are used in the Phenix suite (Liebschner et al., 2019), where applicable, starting from Version 1.20rc4-4425.

Discrete Fourier transform and Fourier coefficients
For a periodic function f(x) of a single variable with period a = 1, the integral Fourier transform results in an infinite number of Fourier coefficients F(h), where h is an integer.The infinite Fourier series defined by these coefficients converges to the function f(x).The sharper the function, the faster the convergence is (we do not specify the formal, mathematically strict, conditions of convergence when studying smooth functions like density distributions).When such a function is sampled on a regular grid with N points per interval, using the obvious equation for integer m gives the convergent Fourier series on the grid nodes x n = n/N which can be expressed as Here H is any integer number, and convergence of the original Fourier series proves convergence of each internal series in the right-hand expression of (12).In other words, given N real numbers on a regular grid, the discrete Fourier transform results in a set of values which possess Hermitian symmetry.There are only N/2 independent values since, according to their definition in (13), F grid (h) are periodic and so F grid (h) = F grid (h + N) for every h.Usually, F grid (h) taken with the consecutive indices are used as approximations to F(h).In some applications, the range 0 � h < N can be chosen.However, since the Fourier coefficients F(h) for convergent series generally decrease with |h|, it is more practical to choose À N/2 < h � N/2, which results in a smaller |F grid (h) À F(h)| difference.

Figure 1
Figure 1The variation � � R ð4Þ n in the � R ð4Þ n factor, as defined in the text, as a function of the grid step d grid .Each data point is the average of � � R ð4Þ n across all structures.Intervals of 1� are given for each grid step.

ð4ÞnFigure 2
Figure 2The distribution P(r solv , r shrink , d grid ) � 10 2 of the variation in R ð4Þ n with respect to the best value � R ð4Þ n for all combinations of the mask parameters.The values on the axes are r solv � 10 (horizontal) and r shrink � 10 (vertical).High P values correspond to the (r solv , r shrink , d grid ) combinations giving R ð4Þ n close to the minimum best value of � R ð4Þ n .The colour scheme indicates the P-function value ranges: red (P < 0.01), yellow (0.01 � P < 0.2), light green (0.2 � P < 0.9) and dark green (P � 0.9).